Predicting Capital Bikes Departures

could be nice to have a pic at the start

knitr::include_graphics("map.png")

DC’s Capital Bikeshare service has been increasing in popularity, especially during the COVID-19 pandemic, when Washingtonians needed alternatives for public transport. To date it has 5,000 bikes and 600+ stations across 7 jurisdictions. However, users find the service unreliable at times, especially at peak times. In this project, our goal is to use and optimize supervised machine learning models that can predict the number of ride-sharing bikes that will be used at any give hour. For simplicity, we apply our models to one station in particular that has particularly high demand: the Lincoln Memorial station. As such, our target variable will be the number of bikes that departed from that station at a given hour.

We chose predictors that vary by the hour that we believe are relevant to individuals’ choices of taking a capital Bikeshare bike. They are of three types:

Being able to predict Capital Bikeshare demand, could result in a more efficient allocation of bikes when stations are re-stocked at night. It could also inform and it could inform the eventual expansion of stations across strategic locations across the city to improve the experience of Washingtonians.

Data

We use Capital Bikeshare’s publicly available historic data, ranging from May 2020 until September 2021. After data cleaning, our dataset has 309197 rows (CONFIRM?), each indicating the number of hours that departed Lincoln Memorial at a certain hour.

For weather predictors we used data from ADD

For sunlight we used data from ADD

Data cleaning

The historic data initially included the following variables:

We had to perform significant data cleaning to be able to use the data for our purposes, which involved generating functions for the following purposes:

We also had to add data on weather and sunlight predictors in the following way: * Clean weather dataset using lubridate functions in order to be able to merge on date and hour * Removing variables with missing observations * Creating a categorical variable for good, bad and “okay” weather * To add sunlight predictors, we had to… [ADD - I DO NOT UNDERSTAND THIS FULLY]

The cleaned dataset includes the following variables:

[ADD]

Overall method

here is where we explain why we used certain algorithms versus others - I thinkw e are using parametric algorithm because we have a very wide dataset?